Aggregate Memory as an Intermediate Checkpoint Storage Device

نویسندگان

Samer Al-Kiswany

Matei Ripeanu

Sudharshan S. Vazhkudai

چکیده

Applications that generate bursty I/O load, like checkpointing, require additional support to perform efficiently on next generation petascale supercomputers. Tens of thousands of processors, generating terabytes of snapshot data at once at each timestep, can easily overwhelm a storage system. Further, even at the current peak I/O bandwidth rates, offered by parallel file system deployments at leadership class facilities, an application is likely to spend a significant portion of its runtime checkpointing. To address these issues, we propose a checkpoint storage device, built from memory resources, that acts as an intermediary to the central parallel file system. Our system comprises of a dedicated manager that aggregates memory resources from processors (benefactors) and makes it available as a collective space for checkpointing clients, using a standard POSIX file system interface. We argue that such a system has the potential to alleviate the I/O bandwidth bottleneck for bursty I/O operations like checkpointing by aggregating memory and interprocessor bandwidth.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A hardware MP3 decoder with low precision floating point intermediate storage

The effects of using limited precision floating point for intermediate storage in an embedded MP3 decoder are investigated in this thesis. The advantages of using limited precision is that the values need shorter word lengths and thus a smaller memory for storage. The official reference decoder was modified so that the effects of different word lengths and algorithms could be examined. Finally,...

متن کامل

An Implementation of Using Remote Memory to Checkpoint Processes

Process checkpointing is a procedure which periodically saves the process states into stable storage. Most checkpointing facilities select hard disks for archiving. However, the disk seek time is limited by the speed of the read-write heads, thus checkpointing process into a local disk requires extensive disk bandwidth. In this paper, we propose an approach that exploits the memory on idle work...

متن کامل

Rollback Recovery Scheme for Distributed Shared Memory Clusters

In this paper, an unified lightweight error recovery scheme based on coordinated checkpointing and rollback for distributed shared memory clusters is proposed. The new scheme maintains multiple globally consistent checkpoints of the state of a distributed shared memory cluster and recovers to a pre-fault checkpoint of the system. It also describes and evaluates the coordinated checkpointing. Th...

متن کامل

Managing Checkpoints for Parallel Programs

Checkpointing is a valuable tool for any scheduling system to have. With the ability to checkpoint, schedulers are not locked into a single allocation of resources to jobs, but instead can stop running jobs, and re-allocate resources with out sacriicing any completed computations. Checkpointing techniques are not new, but they have not been widely available on parallel platforms. We have implem...

متن کامل

A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to the difficulty of the GPU com...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Aggregate Memory as an Intermediate Checkpoint Storage Device

نویسندگان

چکیده

منابع مشابه

A hardware MP3 decoder with low precision floating point intermediate storage

An Implementation of Using Remote Memory to Checkpoint Processes

Rollback Recovery Scheme for Distributed Shared Memory Clusters

Managing Checkpoints for Parallel Programs

A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

عنوان ژورنال:

اشتراک گذاری